Word Image Matching as a Techique for Degraded Text Recognition

نویسندگان

  • Jonathan J. Hull
  • Siamak Khoubyari
  • Tin Kam Ho
چکیده

A technique is presented that determines equivalences between word images in a passage of text. A clustering procedure is applied to group visually similar words. Initial hypotheses for the identities of words are then generated by matching the word groups to language statistics that predict the frequency at which certain words will occur. This is followed by a recognition step that assigns identifications to the images in the clusters. This paper concentrates on the clustering algorithm. A clustering technique is presented and its performance on a running text of 1062 word images is determined. It is shown that the clustering algorithm can correctly locate groups of short function words with better than a 95 percent correct rate.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Word Image Matching in a Methodology for Degraded Text Recognition

A technique for the use of global context in text recognition is presented that determines equivalences between word images in a passage of text. Initial hypotheses for the identities of words are then generated by matching the word groups to language statistics that predict the frequency at which certain words will occur. This is followed by a recognition step and a relaxation-based control st...

متن کامل

Integration of Visual Inter-Word Constraints and Linguistic Knowledge in Degraded Text Recognition

Degraded text recognition is a di cult task. Given a noisy text image, a word recognizer can be applied to generate several candidates for each word image. Highlevel knowledge sources can then be used to select a decision from the candidate set for each word image. In this paper, we propose that visual inter-word constraints can be used to facilitate candidate selection. Visual inter-word const...

متن کامل

Prototype Extraction and Adaptive OCR

ÐTo maintain OCR accuracy with decreasing quality of page image composition, production, and digitization, it is essential to tune the system to each document. We propose a prototype extraction method for document-specific OCR systems. The method automatically generates training samples from unsegmented text images and the corresponding transcripts. It is tolerant of transcription errors, so a ...

متن کامل

Line and Word Matching in Old Documents

This paper is concerned with the problem of establishing an index based on word matching. It is assumed that the book was digitised as better as possible and some pre-processing techniques were already applied as line orientation correction and some noise removal. However two main factor are responsible for being not possible to apply ordinary optical character recognition techniques (OCR): the...

متن کامل

RODRIGUEZ-SERRANO, PERRONNIN: LABEL EMBEDDING FOR TEXT RECOGNITION 1 Label embedding for text recognition

The standard approach to recognizing text in images consists in first classifying local image regions into candidate characters and then combining them with high-level word models such as conditional random fields (CRF). This paper explores a new paradigm that departs from this bottom-up view. We propose to embed word labels and word images into a common Euclidean space. Given a word image to b...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998